Unsupervised learning of natural languages.
نویسندگان
چکیده
We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
منابع مشابه
Efficient, Correct, Unsupervised Learning for Context-Sensitive Languages
A central problem for NLP is grammar induction: the development of unsupervised learning algorithms for syntax. In this paper we present a lattice-theoretic representation for natural language syntax, called Distributional Lattice Grammars. These representations are objective or empiricist, based on a generalisation of distributional learning, and are capable of representing all regular languag...
متن کاملAn algorithm for the unsupervised learning of morphology
This paper describes in detail an algorithm for the unsupervised learning of natural language morphology, with emphasis on challenges that are encountered in languages typologically similar to European languages. It utilizes the Minimum Description Length analysis described in Goldsmith 2001 and has been implemented in software that is available for downloading and testing. 1. Scope of this pap...
متن کاملTwo Approaches for Building an Unsupervised Dependency Parser and Their Other Applications
Much work has been done on building a parser for natural languages, but most of this work has concentrated on supervised parsing. Unsupervised parsing is a less explored area, and unsupervised dependency parser has hardly been tried. In this paper we present two approaches for building an unsupervised dependency parser. One approach is based on learning dependency relations and the other on lea...
متن کاملUnsupervised Language Acquisition: Theory and Practice
In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the proposition that humans have language-specific innate knowledge. I start by examining an a priori ...
متن کاملModeling Acquisition of Word Structure with Lexicalized Grammar Learning
Introduction This paper introduces a framework for learning structure in natural languages, and reports results from a simple application of it to learning word-syntax of an agglutinative language in an unsupervised manner. Arguably, the learning environment of children acquiring languages provides more information—by means of linguistic interaction and extralinguistic information present in th...
متن کاملModeling Acquisition of Word Structure with Lexicalized Grammar Learning
This paper introduces a framework for learning structure in natural languages, and reports results from a simple application of it to learning word-syntax of an agglutinative language in an unsupervised manner. Arguably, the learning environment of children acquiring languages provides more information—by means of linguistic interaction and extralinguistic information present in the learning se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Proceedings of the National Academy of Sciences of the United States of America
دوره 102 33 شماره
صفحات -
تاریخ انتشار 2005